Introduction to Open Data Science - Course Project

About the project

2: Regression and model validation

Describe the work you have done this week and summarize your learning.

2.1 Data wrangling.

After understanding the metadata file and skipping the Finnish language instructions that were unnecessary it was easy to prepare a data file this data file. The script can be found here.I used plyr and tiidyverse to do the conversions of the raw data into a workable file.

2.2 Analysis

The dataset is the result of a survey on teaching and learning that was conducted as a research project in 2014. The dataset was loaded in the next chunk and the dimensions and structure of the document are output.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
learning2014_csv <- read_csv("/home/alex/ods_course/IODS-project_R/data/learning2014_AA.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014_csv)
## [1] 166   7
str(learning2014_csv)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The data.frame contains 166 lines and 7 columns. The first column encodes gender, and has the character type. The rest are all numeric, and contain the data which will be used for the regression analysis. The “attitude”, “deep”, “stra”, and “surf” columns contain the combinations of results from the original dataframe. The full metadata explanation can be found here. The explanation for the study can be found here.

Briefly:

  • attitude variable is Global attitude toward statistics composite score of the questions having to do with global attitude toward statistics
  • deep variable is Deep approach adjusted composite score of the questions connected with deep understanding of statistics
  • stra variable is Strategic approach adjusted composite score of the questions connected with how strategically the participant approaches the subject
  • surf variable is Surface approach adjusted composite score of the questions connected with how whether the participant can understand the material deeply and whther the participants has problems studying.

2.2.2 Data exploration

2.2.2.1 Summary for the variables

summaries <- learning2014_csv %>%
  summarise(across(where(is.numeric), list(mean = ~mean(.), sd = ~sd(.), min = ~min(.), max = ~max(.), median = ~median(.))))

print(t(summaries))
##                       [,1]
## age_mean        25.5120482
## age_sd           7.7660785
## age_min         17.0000000
## age_max         55.0000000
## age_median      22.0000000
## attitude_mean    3.1427711
## attitude_sd      0.7299079
## attitude_min     1.4000000
## attitude_max     5.0000000
## attitude_median  3.2000000
## deep_mean        3.6797189
## deep_sd          0.5541369
## deep_min         1.5833333
## deep_max         4.9166667
## deep_median      3.6666667
## stra_mean        3.1212349
## stra_sd          0.7718318
## stra_min         1.2500000
## stra_max         5.0000000
## stra_median      3.1875000
## surf_mean        2.7871486
## surf_sd          0.5288405
## surf_min         1.5833333
## surf_max         4.3333333
## surf_median      2.8333333
## points_mean     22.7168675
## points_sd        5.8948836
## points_min       7.0000000
## points_max      33.0000000
## points_median   23.0000000

2.2.2.2 Graphical data summary

ggpairs(learning2014_csv, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

Description

  1. The gender distribution is uneven
  2. The age Is also skewed to younger participants, with few only a older participants in each gender
  3. No other variable has a stark clustering behaviour
  4. There are significant correlations between variables
    • Negative:
      • surf/deep - these are expected to be inversely correlated, as are surf/attitude and surf/strategy
      • age/surf appear to be negatively correlated, but the p-value is between 0.5 and 0.1, as it is for surf/points.
    • Positive:
      • attitude/points, stra/points

These results make sense at the first glance.

2.2.3 Linear regression

I have chosen the three variables to check: attitude, age and surf

linear_modelling_for_learning2014_csv  <- lm(points ~ surf + attitude + age, data = learning2014_csv) 
summary(linear_modelling_for_learning2014_csv)
## 
## Call:
## lm(formula = points ~ surf + attitude + age, data = learning2014_csv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.973  -3.430   0.167   3.997  10.444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.85628    3.53786   4.765 4.18e-06 ***
## surf        -0.96049    0.79943  -1.201    0.231    
## attitude     3.42388    0.57353   5.970 1.45e-08 ***
## age         -0.08713    0.05361  -1.625    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.294 on 162 degrees of freedom
## Multiple R-squared:  0.2082, Adjusted R-squared:  0.1935 
## F-statistic:  14.2 on 3 and 162 DF,  p-value: 2.921e-08

Only one explanatory variable showed statistically significant connection to the points outcome, thus only attitude variable will be kept for the next modelling step.

linear_modelling_for_learning2014_csv_pruned  <-lm(points ~ attitude, data = learning2014_csv) 
summary(linear_modelling_for_learning2014_csv_pruned)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014_csv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The second model is simpler, but have not lost too much of the explanatory power: - The p-value is comparably low, much lower than 0.001, meaning that there is strong support to reject the hypothesis that the 2 variables have no connection. - The estimate shows how the two variables are connected, for one unit change in points the attitude variable changes by 3.5255 The adjusted R-value 0.1856 is relatively low, indicating that the model may not be very effective in predicting or explaining the dependent variable. This might be due to various reasons like missing important predictors, non-linear relationships, or high levels of noise in the data.

2.2.3.1 Linear model diagnositic plots

plot(linear_modelling_for_learning2014_csv_pruned, which=c(1,2,5))

Using these plots we can investigate the assumptions of the model.

  1. We can use the “Residuals vs fitted” plots to investigate the constant variance assumption. If the variance is constant, we should expect to see point distributed without noticeable structure i.e. randomly. This is more or less what we see in our data. Although the points around 28 on the x axis seem to be bunched up, this may also be the result of low n-numbers, so we can assume constant variance.
  2. The q-q recapitulates plot can be used to identify whether the “normal distribution of residuals” is met. The middle part of the plot follows the 45 degree line, meaning that the distribution is close to normal for the bulk of the data. The tails of the plot do deviate form the line, meaning that there is a deviation form the perfect normality. At the bottom left there are 3 points that can be considered outliers. Further investigation is needed to determine whether these point should not be included into furhter analysis.
  3. The residuals vs Leverage plot shows whitewater the data contains influential outliers. Three points are highlighted: 71, 56 and 35. These should be investigated to determine if the data is fine, for example there can be a data entry mistake, missing value problem, or a true outlier. The points 56 and 35 are present on both the q-q and the residuals vs leverage plots, so the next step would be checking the data to see what happened to these points.

3: Logistic regression

Necessary packages

if (!require(tidyverse) == T) {
  install.packages("tidyverse")
}
library(tidyverse)


if (!require(gt) == T) {
  install.packages("gt")
}
## Loading required package: gt
library(gt)

3.1 Data wrangling

The dataframes were merged a modified according to the instructions. The resulting dataframe contains 370 observations with 35 columns. The r-script used is here.

3.2 Read the file and describe the data a bit

aa_alc <- read_csv("/home/alex/ods_course/IODS-project_R/data/aa_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(aa_alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school     : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex        : chr [1:370] "F" "F" "F" "F" ...
##  $ age        : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address    : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize    : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus    : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu       : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu       : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob       : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob       : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason     : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian   : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime : num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime  : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup  : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup     : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities : chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery    : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher     : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet   : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic   : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel     : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime   : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout      : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc       : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc       : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health     : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures   : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid       : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences   : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1         : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2         : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3         : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ average_alc: num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use   : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   average_alc = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
colnames(aa_alc)
##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "guardian"    "traveltime"  "studytime"   "schoolsup"  
## [16] "famsup"      "activities"  "nursery"     "higher"      "internet"   
## [21] "romantic"    "famrel"      "freetime"    "goout"       "Dalc"       
## [26] "Walc"        "health"      "failures"    "paid"        "absences"   
## [31] "G1"          "G2"          "G3"          "average_alc" "high_use"

The data show student achievement in secondary education of two Portuguese schools, the data contains 370 observations of 35 variables . The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The different features can be used to analyse the dataset and make predictions. A more detailed information can be found here

  1. “school”: The name or type of school the student attends.

  2. “sex”: The student’s gender.

  3. “age”: The student’s age.

  4. “address”: The student’s home address or type of locality (urban/rural).

  5. “famsize”: The size of the student’s family.

  6. “Pstatus”: The marital status of the student’s parents.

  7. “Medu”: The mother’s education level.

  8. “Fedu”: The father’s education level.

  9. “Mjob”: The mother’s job.

  10. “Fjob”: The father’s job.

  11. “reason”: The reason for choosing this school.

  12. “guardian”: The student’s primary guardian.

  13. “traveltime”: Time taken to travel to school.

  14. “studytime”: Time spent on studying outside of school.

  15. “schoolsup”: Whether the student receives school support.

  16. “famsup”: Whether the student receives family support.

  17. “activities”: Participation in extracurricular activities.

  18. “nursery”: Whether the student attended nursery school.

  19. “higher”: Aspirations for higher education.

  20. “internet”: Access to the internet at home.

  21. “romantic”: Involvement in a romantic relationship.

  22. “famrel”: Quality of family relationships.

  23. “freetime”: Amount of free time after school.

  24. “goout”: Frequency of going out with friends.

  25. “Dalc”: Daily alcohol consumption.

  26. “Walc”: Weekly alcohol consumption.

  27. “health”: General health status.

  28. “failures”: Number of past class failures.

  29. “paid”: Whether the student is enrolled in extra paid classes.

  30. “absences”: Number of school absences.

  31. “G1”: Grade in the first period.

  32. “G2”: Grade in the second period.

  33. “G3”: Final grade.

  34. “average_alc”: Average alcohol consumption (perhaps a computed variable from Dalc and Walc).

  35. “high_use”: Indicator of high alcohol use (likely a derived variable).

3.3 Choosing the variable

I decided to look at “failures”, “absences”, “freetime” and “famrel”. I would expect that:

  1. The failures might be positively associated with the higher consumption
  2. The abscences might also be positively associated with the higher consumption
  3. The freetime may be associated with weekly consuption, but It is possible that the association will be inverse.
  4. The family relations may be inversely correlated.

3.4 Testing the assumptions:

summaries <- aa_alc %>% group_by(high_use) %>% select(.,c("failures", "absences", "freetime", "famrel","high_use")) %>% 
  summarise(across(where(is.numeric), list(Average = ~mean(.), sd = ~sd(.))))
gt(summaries) %>% cols_align("left") %>% opt_stylize(color="gray", style=3)  %>% tab_header("Table for chosen variables")
Table for chosen variables
high_use failures_Average failures_sd absences_Average absences_sd freetime_Average freetime_sd famrel_Average famrel_sd
FALSE 0.1196911 0.4370868 3.710425 4.464017 3.119691 0.9869096 4.007722 0.8935266
TRUE 0.3513514 0.7464906 6.378378 7.060845 3.468468 0.9421428 3.765766 0.9337603

These are the averages and standard deviation values for all the variables, to better see which variables are interesting to look at closer.

First looking at overall structure, whehther there is something to connect the chosen variables.

ggpairs(select(aa_alc,c("failures", "absences", "freetime", "famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

As it turns out the free time is strongly correlated with family relations, which I did not expect.

Failures

ggpairs(select(aa_alc,c("failures","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The number of failures is slightly higher in the high consumption group, but not bu much. The overwhelming majority of participants in each case have 0 falures. This was expected..

Abscences

ggpairs(select(aa_alc,c("absences","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The number of absences in the high use groups is higher, as expected. The standard deviation is also much higher in this group, suggesting more variability in the absences.

Freetime

ggpairs(select(aa_alc,c("freetime","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The freetime variable has different mean and very similar sd, the high consumption group has a slightly higher average. I was interested in the relation, and the results are interesting.

Family relation

ggpairs(select(aa_alc,c("famrel","high_use")), mapping = aes(), lower = list(combo = wrap("facethist", bins = 20))) + theme_minimal()

The family relation variable seems to be lower in the high consumption group, as expected.

3.5 Using logistic regression

first_log_regression <- glm(high_use ~ failures + absences + freetime + famrel, data = aa_alc, family = "binomial")

# create model summary and extract the coeffs
summary(first_log_regression)
## 
## Call:
## glm(formula = high_use ~ failures + absences + freetime + famrel, 
##     family = "binomial", data = aa_alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.55560    0.63703  -2.442 0.014608 *  
## failures     0.58297    0.20259   2.878 0.004007 ** 
## absences     0.08375    0.02230   3.755 0.000173 ***
## freetime     0.43091    0.12804   3.365 0.000764 ***
## famrel      -0.31981    0.13098  -2.442 0.014622 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 408.17  on 365  degrees of freedom
## AIC: 418.17
## 
## Number of Fisher Scoring iterations: 4

From the model we can see the following:

  • The higher values of failures absences and the freetime predictors are associated with a higher likelihood of high_use being TRUE, for family relations the opposite is true.
  • All predictors appear to be statistically significant, as indicated by their p-values and significance codes
  • The reduction in deviance from null to residual suggests the model with predictors fits better than the null model.
  • The reduction in deviance from null to residual suggests the model with predictors fits better than the null model - the one without ony predictor variables.
coeffs <- summary(first_log_regression) %>% coefficients()
coeffs
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.55560199 0.63703172 -2.441954 0.0146080178
## failures     0.58296988 0.20258843  2.877607 0.0040070412
## absences     0.08375499 0.02230317  3.755295 0.0001731376
## freetime     0.43090906 0.12804024  3.365419 0.0007642749
## famrel      -0.31981102 0.13098388 -2.441606 0.0146221015
OddsRatio <- exp(coeffs)
confidence <- confint(first_log_regression)
## Waiting for profiling to be done...
result <- cbind(OddsRatio, confidence)
result
##              Estimate Std. Error     z value Pr(>|z|)      2.5 %      97.5 %
## (Intercept) 0.2110623   1.890860  0.08699073 1.014715 -2.8317664 -0.32620631
## failures    1.7913506   1.224568 17.77169287 1.004015  0.1917587  0.99258448
## absences    1.0873624   1.022554 42.74681512 1.000173  0.0420769  0.13014453
## freetime    1.5386556   1.136599 28.94562433 1.000765  0.1840339  0.68715409
## famrel      0.7262863   1.139949  0.08702100 1.014730 -0.5789180 -0.06341917

Here we can see the same, the first three predictors have positive association with consumption, the family relation variable has a negative association. We can see that the strongest connection is with failures, then freetime and famrel, while the link to absences is low.

It is important to keep in mind that this is logistical regression, not the previosuly investigated linear regression. Tha means that the estimate in this table represents the “odds ratos” and be thought of in terms of likelihood. It means, that for examples in our case the increase in failures by one unit increases the likelihood of high alcohol consumption by 1.8.

The results are in agreement with the hypotheses I started with. Interestingly there seems to be a connection between the free time and the consumption levels, which were not obvious when just looking at the data.

3.6 Exploring the predictive power

We can use test the predictive of the model that we created earlier to see if it can be used to actually predict the alcohol consumption based on the the four chosen variables. We can predict for each student the probability of increased consumption.

aa_alc$predicted_probabilities <- predict(first_log_regression, type = "response")  

aa_alc <- mutate(aa_alc,prediction  = predicted_probabilities > 0.5) 

table(Actual = aa_alc$high_use, Predicted = aa_alc$prediction)
##        Predicted
## Actual  FALSE TRUE
##   FALSE   236   23
##   TRUE     85   26

These are the 2x2 cross tabulation count results.

((table(Actual_percentage = aa_alc$high_use, Predicted_percentage = aa_alc$prediction))/length(aa_alc$prediction)) * 100
##                  Predicted_percentage
## Actual_percentage     FALSE      TRUE
##             FALSE 63.783784  6.216216
##             TRUE  22.972973  7.027027

The False outcome was correctly predicted for 236 participants (63%), True was predicted for 26 (7%).

library(ggplot2)
ggplot(aa_alc, aes(x=high_use, y=prediction)) +
  geom_jitter(color="blue") +
  theme_minimal() +
  labs(title="Actual vs Predicted Alcohol Consumption")

confusion_matrix <- table(Predicted = aa_alc$prediction,Actual = aa_alc$high_use)
training_error <- 1 - sum(diag(confusion_matrix)) / sum(confusion_matrix)
training_error
## [1] 0.2918919

The total training error was 29%

3.6.2 Comparing to simple guessing strategy

We can compare our model to a simple guessing strategy - always predicting the most common class

First we determine which is the bigger group:

sum(aa_alc$high_use)/length(aa_alc$high_use)
## [1] 0.3

There are 30% of higher consuming participants, so the more prevalent group is low consuming.

simple_guess_accuracy <- mean(aa_alc$high_use == F)
model_accuracy <- mean(aa_alc$high_use == aa_alc$prediction)
simple_guess_accuracy
## [1] 0.7
model_accuracy
## [1] 0.7081081

The accuracy of the model is marginally better than guessing.

3.7 Cross-validating the model

We can preform a cross-validation of the model, meaning that we will test the model performance on subset of the same data fo determine how accurately the chosen model can work in practice. For that we will be using the cv.glm function from the boot library. The idea is to test how well the model predicts the status using the ten different subset subsets fo the data in sequence.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
library(boot)

cv <- cv.glm(data = aa_alc, cost = loss_func, glmfit = first_log_regression, K = 10)

cv$delta[1]
## [1] 0.2972973

The prediction of my model is 0.289 compared to the 0.26, which is worse. I used 4 predictors, and i did not include the sex predictor into my model, which might be important. Additionally two of my chosen factors were correlated quite strongly, so one might not add a lot to the model.

3.8

We can try to use all the predictors to see how the number of predictors influences the model parameters. For this we exclude all the predictors that were used to create the outcome variable

First we load additional lobraries

if (!require(caret) == T) {
  install.packages("caret")
}
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(caret)

if (!require(glmnet) == T) {
  install.packages("glmnet")
}
## Loading required package: glmnet
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
library(glmnet)
predictors <- setdiff(colnames(aa_alc), c("Dalc","Walc","average_alc","high_use","predicted_probabilities","prediction" ))
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()

aa_alc$high_use <- as.factor(aa_alc$high_use)

# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
    # Select a subset of predictors
    current_predictors <- predictors[1:i]
    formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
    set.seed(123)  # for reproducibility
    fitControl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation
    model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
    
    # Store training and testing errors
    training_errors <- c(training_errors, mean(model$results$Accuracy))
    testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
    number_of_predictors <- c(number_of_predictors, i)
    model_summaries[[i]] <- model$coefnames
}

results_df <- data.frame(
    Number_of_Predictors = number_of_predictors,
    Training_Error = training_errors,
    Testing_Error = testing_errors
)

# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
    geom_line(aes(y = Training_Error, colour = "Training Error")) +
    geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
    labs(title = "Training and Testing Errors by Number of Predictors",
         x = "Number of Predictors",
         y = "Error") +
    theme_minimal()

This shows, that the number of predictors makes the model worse, however there is an increase in model performance at 26th point in the plot. If we look at the predicot first present in the 26th point:

setdiff(model_summaries[[26]],model_summaries[[25]])
## [1] "failures"

As we know, the failures was a great predictor. So to check how the plot looks if we start with the better predictors I reversed the predictors vector.

predictors  <- rev(predictors)
# Initialize vectors to store errors
training_errors <- c()
testing_errors <- c()
number_of_predictors <- c()
model_summaries <- list()

# Loop over models with decreasing number of predictors
for (i in length(predictors):1) {
    # Select a subset of predictors
    current_predictors <- predictors[1:i]
    formula <- as.formula(paste("high_use ~", paste(current_predictors, collapse = "+")))
    set.seed(123)  # for reproducibility
    fitControl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation
    model <- train(formula, data = aa_alc, method = "glm", family = "binomial", trControl = fitControl)
    
    # Store training and testing errors
    training_errors <- c(training_errors, mean(model$results$Accuracy))
    testing_errors <- c(testing_errors, 1 - max(model$results$Accuracy))# Replace 'Accuracy' with the appropriate metric
    number_of_predictors <- c(number_of_predictors, i)
    model_summaries[[i]] <- model$coefnames
}

results_df <- data.frame(
    Number_of_Predictors = number_of_predictors,
    Training_Error = training_errors,
    Testing_Error = testing_errors
)

# Plotting
ggplot(results_df, aes(x = Number_of_Predictors)) +
    geom_line(aes(y = Training_Error, colour = "Training Error")) +
    geom_line(aes(y = Testing_Error, colour = "Testing Error")) +
    labs(title = "Training and Testing Errors by Number of Predictors",
         x = "Number of Predictors",
         y = "Error") +
    theme_minimal()

That makes the model better overall, with the minimum value being:

## [1] 0.2454599

With adding of the guardian variable.

predictors[20]
## [1] "guardian"

This shows that increasing the number of predictors negatively influences the model, but adding valueable predictors imrpoves it.


4: Clustering and classification

4.1.1: Loading necessary libraries

library(MASS)  # For the Boston dataset
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)  # For graphical representations
library(caret)  # For data splitting and preprocessing
library(corrplot)
## corrplot 0.92 loaded

4.1.2: Loading the data and investigating the data structure

# Step 1: Load and Explore the Boston Dataset
data("Boston")
# Exploring the structure and dimensions of the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This dataset contains housing values in the suburbs of Boston, has 506 rows and 14 columns, all numerical, chas can be considered categorical as it can only be 0 and 1. Each row represents a different suburb. Columns are various features like crime rate, number of rooms, age of the housing, etc. More complete description can be found here

The variables have the foillowing decriptions:

  • CRIM - per capita crime rate by town
  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS - proportion of non-retail business acres per town.
  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - proportion of owner-occupied units built prior to 1940
  • DIS - weighted distances to five Boston employment centres
  • RAD - index of accessibility to radial highways
  • TAX - full-value property-tax rate per $10,000
  • PTRATIO - pupil-teacher ratio by town
  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT - % lower status of the population
  • MEDV - Median value of owner-occupied homes in $1000’s

4.1.3: Graphical Overview and Summary of Variables

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

All the statistics for each variable is not directly comparable, as is expected for real-world data. We can now the relations between the variables to look deeper into the data.

  • The distributions are skewed: nox and dis have skewed to low values, age to high values. The proportion of non-retail business acres per town, has a distribution with two peaks, as does the tax variable.

  • All the variables appear to be correlated. Only 8 pairs are not significantly correlated, all connected to the “categorical variable” Charles River. Most of the correlation make sense: e.g. nox concentrations and the distance to the city are negatively correlated, but nox and industrialisation are positively correlated.

  • Crime rate appears to have a statistically significant correlation all of vars, but chas. Seems to correlate negatively with 5 variables, and positively with 7 variables.

We can plot the correlation in a more visually pleasing way. All the relations between the variables can be seen clearly on this figure

cor_matrix <- cor(Boston) 

corrplot(cor_matrix)

All the relations between the variables can be seen clearly on this figure

4.1.4: Standardising the dataset

As the data described very different phenomena, the values were not directly comparable. E.g. mean value for tax was 408, while nox was ~ 0.55. In order to eliminate that we can standardise the data.

scaled_Boston <- as.data.frame(scale(Boston))
summary(scaled_Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now all the means are the same - 0. This standardization makes variables more comparable and often improves machine learning model performance.

4.1.4.1: Creating a categorical variable from scaled crime rate

Using quantiles as breakpoints, and removing the old crime rate variable.

quantiles <- quantile(scaled_Boston[, "crim"], probs = c(0, 0.25, 0.5, 0.75, 1))
scaled_Boston$categorical_crim <- cut(scaled_Boston[, "crim"], breaks = quantiles, 
                               include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
scaled_Boston <- scaled_Boston[,-which(names(Boston) == "crim")]

Splitting the Dataset into Train and Test Sets

80% of the data is now in the training set, and the remaining 20% is in the test set.

set.seed(123) # For reproducibility
indexes <- createDataPartition(scaled_Boston$categorical_crim, p = 0.8, list = FALSE)
train_set <- scaled_Boston[indexes, ]
test_set <- scaled_Boston[-indexes, ]

4.1.5: Fitting the linear discriminant analysis on the train set

# Loading necessary library
library(MASS)  # For lda function
library(ggplot2)  # For creating biplot

# Step 1: Fit the Linear Discriminant Analysis Model
# Fitting LDA with categorical crime rate as target variable
lda_model <- lda(categorical_crim ~ ., data = train_set)

# Step 2: Summary of the LDA Model
lda_model
## Call:
## lda(categorical_crim ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2512315 0.2487685 0.2487685 0.2512315 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       0.9279155 -0.9002730 -0.19513102 -0.8740789  0.4510628 -0.8867213
## med_low  -0.0913696 -0.2911617 -0.03844192 -0.5807849 -0.1223160 -0.3793710
## med_high -0.3836558  0.1931104  0.15646403  0.3930228  0.1140281  0.4230461
## high     -0.4872402  1.0149946 -0.07933396  1.0662955 -0.4222547  0.8161999
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8428080 -0.6789912 -0.7285408 -0.39200564  0.37567764 -0.77093834
## med_low   0.3597389 -0.5361295 -0.4615109 -0.03982856  0.31663833 -0.16048504
## med_high -0.3650188 -0.4542586 -0.3302709 -0.28998920  0.08434782 -0.06173942
## high     -0.8559422  1.6596029  1.5294129  0.80577843 -0.76539336  0.90035435
##                medv
## low       0.5324073
## med_low  -0.0140094
## med_high  0.1977447
## high     -0.7285395
## 
## Coefficients of linear discriminants:
##                 LD1          LD2        LD3
## zn       0.11143638  0.643005769 -0.9943789
## indus    0.08740548 -0.227273609  0.2772096
## chas     0.01766274 -0.058628846  0.1539449
## nox      0.18483474 -0.897262214 -1.3248989
## rm      -0.01332317 -0.033607221 -0.1228014
## age      0.25569474 -0.405630630 -0.1938632
## dis     -0.12730753 -0.346775059  0.0996160
## rad      4.07862012  0.908418329 -0.1089458
## tax      0.06001905  0.008407111  0.5321260
## ptratio  0.15504095  0.072786263 -0.3434405
## black   -0.10715634  0.034155460  0.1240672
## lstat    0.14772261 -0.131485744  0.4522005
## medv     0.04936251 -0.364822504 -0.2088124
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9637 0.0274 0.0088
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# The biplot visualizes the linear discriminants (LD1 and LD2) and shows
# how the observations in the training set are separated based on the
# categorical crime rate.
classes <- as.numeric(train_set$categorical_crim)

plot(lda_model, dimen = 2, col = classes,pch = classes)
lda.arrows(lda_model, myscale = 2)

The LD1 is influences mostly by rad, the LD2 is influeneces by zn and nox in similar marnitude, but different direction.

4.1.6: Fitting the linear discriminant analysis on the train set

# Step 6: Save Crime Categories and Update Test Set
# Saving the crime categories from the test set
test_crime_categories <- test_set$categorical_crim

# Removing the categorical crime variable from the test set
test_set <- test_set[,-which(names(test_set) == "categorical_crim")]

# Step 7: LDA Model Prediction
# Fit LDA model on the training set
library(MASS) # LDA is in the MASS package
fit_lda <- lda(categorical_crim ~ ., data = train_set)

# Predicting on the test set
predictions <- predict(fit_lda, newdata = test_set)

table, prop.table and addmargins

table(Predicted = predictions$class, Actual = test_crime_categories) %>% 
  addmargins()
##           Actual
## Predicted  low med_low med_high high Sum
##   low       17       3        1    0  21
##   med_low    8      15        4    0  27
##   med_high   0       7       17    1  25
##   high       0       0        3   24  27
##   Sum       25      25       25   25 100

The model is reasonably good, especially at the high and low values, and can be used to differentiate high and low groups. The middle groups become muddled.

4.1.7: K-means clustering

Reloading the dataset, scaling and calculating the distances:

data(Boston)

Boston_scaled <- data.frame(scale(Boston))

dist_euc <- dist(Boston_scaled)
summary(dist_euc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Mean distance is 4.7. In order to determine the optimal number of clusters we will look at the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. We can perform clustering with different k numbers and then look at what happens to the WCSS when the k number is increased. If the value drops rapidly - the k number is good. But the larger the k the harder it may be to interpret the results. So we can start checking and plotting k from 2 to 20, jst to see what happens.

k_max <- 20

twcs <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})

elbow_data <- data.frame(
  k = 1:k_max,
  twcs = twcs
)
elbow_plot <- ggplot(elbow_data, aes(x = k, y = twcs)) +
  geom_line() +  # Connect points with lines
  geom_point() +  # Add points
  scale_x_continuous(breaks = 1:k_max) +  # Ensure k values are treated as continuous
  labs(x = "Number of Clusters (k)", y = "Total Within-Cluster Sum of Squares", 
       title = "Determining Optimal k") +
  theme_minimal()

elbow_plot

The plot changes the slope quite quickly at k = 2, so we can use this clustering for the further analysis.

Boston_scaled_km <- kmeans(Boston_scaled, 2)
pairs(Boston_scaled,col = Boston_scaled_km$cluster)

pairs(Boston_scaled[,c(1,10)],col = Boston_scaled_km$cluster)

There seems to be good cluster separation between crime and tax.

pairs(Boston_scaled[,c(3,5)],col = Boston_scaled_km$cluster)

An even better separation for indus and nox. High indus and high nox cluster, and cluster of both low values.

pairs(Boston_scaled[,c(2,3)],col = Boston_scaled_km$cluster)

There seems also to be good separation for industry and zn variables, two clusters: low indus and high zn, and vice versa. As we saw earlier the chosen variables have high correlation, and logically should also be correlated. So these results make sense.

4.1.8*: Bonus. Visualising clustering results with a biplot:

The elbow plot shows that the last point with decreased WCSS is 6, so I decided to look what k=9 clustering looks like. And redoing the LDA using the cluster as the target classes.

set.seed(8) # For reproducibility

data("Boston")

Boston_scaled <- as.data.frame(scale(Boston))


Boston_scaled_km <- kmeans(Boston_scaled, centers = "6")


lda.fit <- lda(Boston_scaled_km$cluster ~ . , data = Boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(Boston_scaled_km$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)

As we can see, the cluster 3 is saparate from all the othersm and the Charles River variable is the main determinant for this cluster. Cluster 5 is separate from others, this separation is dependent on the radial roads variable.

4.1.9*: Bonus. Trainig data that you used to fit the LDA and visualisation:

Installing/loading plotly

if (!require(plotly) == T) {
  install.packages("plotly")
}
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(plotly)
lda.fit <- lda(categorical_crim ~ . , data = train_set)


model_predictors <- dplyr::select(train_set, -categorical_crim)
# check the dimensions
dim(model_predictors)
## [1] 406  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train_set$categorical_crim, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = factor(Boston_scaled_km$cluster[indexes]), type= 'scatter3d', mode='markers')

The crime colouring shows that high crime group is clustered separately, with some med-high groups. The k-means clustering colouring shows clusters 3 and 4 to be together, as in 2d plain, however the 3rd group is split between the two clusters.

4.2 Data wrangling for the next week’s data!

The R script transforming the data for the next week asignment is in the repository, with the human.csv file in the same directory


5: Dimensionality reduction techniques

5.1 Data wrangling

The data wrangling was continued in this script These are the descritions, a more detailed information can be found here: https://hdr.undp.org/data-center/documentation-and-downloads, first and second table:

  • HDI_rank: “HDI Rank” in the first table, indicating the country’s global ranking based on the Human Development Index.
  • Country: The name of the country being evaluated, present in both tables.
  • HDI: “Human Development Index (HDI) Value” in the first table, measuring average achievement in key human development dimensions.
  • LifeExp_B: “Life Expectancy at Birth (years)” from the first table, showing the average number of years a newborn is expected to live under current mortality rates.
  • EduExp: “Expected Years of Schooling (years)” in the first table, estimating total years of schooling a child is expected to receive.
  • EduMean: “Mean Years of Schooling (years)” in the first table, representing the average years of education received by people aged 25 and older.
  • GNI_C: “Gross National Income (GNI) Per Capita (2017 PPP $)” from the first table, indicating the average income of a country’s citizens, adjusted to purchasing power parity.
  • GNI_C_HDIrank: “GNI Per Capita Rank Minus HDI Rank” from the first table, showing the difference between the country’s GNI per capita rank and its HDI rank.
  • GII_Rank: The ranking of the country in the Gender Inequality Index, part of the second table.
  • GII: “Gender Inequality Index” from the second table, measuring gender disparities in reproductive health, empowerment, and economic activity.
  • MMRatio: “Maternal Mortality Ratio” from the second table, indicating the number of maternal deaths per 100,000 live births.
  • AdBRate: “Adolescent Birth Rate” from the second table, referring to the number of births per 1,000 women aged 15-19.
  • RP_p: “Share of Seats in Parliament” in the second table, representing the percentage of parliamentary seats held by women.
  • F_2ED_p and M_2ED_p: The percentage of the female and male population, respectively, with at least some secondary education, as indicated in the second table.
  • F_LFPR and M_LFPR: The “Labor Force Participation Rate” for females and males, respectively, from the second table.
  • FM_2ED_Ratio: A metric comparing the ratio of females to males with at least some secondary education.
  • FM_LFPR_Ratio: A metric comparing the ratio of females to males labor force.

5.2 Analysis

Loading additional libraries

if (!require(FactoMineR) == T) {
  install.packages("FactoMineR")
}
## Loading required package: FactoMineR
library(FactoMineR)

5.2.1 Graphical overview and description of the data

human_data_ <- read.csv(file="/home/alex/ods_course/IODS-project_R/data/human.csv")
human_data_ <- column_to_rownames(human_data_,var ="Country") 

Graphical summary:

ggpairs(human_data_, progress = F)

1. All the distributions are skewed, meaning that extreme values are more common for some variables. Life expectancy tends to high, GNI_C tend to low. 2. A lot of variables are strongly correlated positively and negatively: - Life expectancy, education levels, GNI are positively correlated with good significance, and are negatively correlated with “Maternal Mortality Ratio” and “Adolescent Birth Rate” with similar significance. This is to be expected. - MMortality is correlated with adolscent births, which is to be expected. - Higher secondary education in females is positivly correlated with GNI and life expectancy.

These resutls are not surprising, as a lot of the indexes are associated with the overall wealth, even though there might be outliers.

Summary:

summary(human_data_)
##   FM_2ED_Ratio    FM_LFPR_Ratio        EduExp        LifeExp_B    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7254   1st Qu.:0.5981   1st Qu.:11.22   1st Qu.:66.42  
##  Median :0.9327   Median :0.7514   Median :13.45   Median :74.00  
##  Mean   :0.8499   Mean   :0.7034   Mean   :13.13   Mean   :71.58  
##  3rd Qu.:0.9958   3rd Qu.:0.8525   3rd Qu.:15.20   3rd Qu.:76.95  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##      GNI_C           MMRatio          AdBRate            RP_p      
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4495   1st Qu.:  13.0   1st Qu.: 13.57   1st Qu.:12.50  
##  Median : 12081   Median :  55.0   Median : 35.00   Median :19.30  
##  Mean   : 17344   Mean   : 150.3   Mean   : 47.35   Mean   :20.88  
##  3rd Qu.: 23112   3rd Qu.: 190.0   3rd Qu.: 71.78   3rd Qu.:27.55  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The data is not normalised, so it is necesserily not really comparable.

5.2.2 Performing principal component analysis (PCA) on the raw (non-standardized) human data

We perform the PCA analysis for non-standardised data to see why it is important:

human_data_PCA_raw <- prcomp(human_data_)
summary(human_data_PCA_raw)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.821e+04 183.5476 24.84 11.23 3.696 1.539 0.1913 0.1568
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

Most of the variance is included into the first component.

biplot(human_data_PCA_raw, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

The GNI_C is mostly contributing to the PC1. That is because GNI_C has values up to 123124, which is a lot more than any other variable. The nations with high GNI_C are to the left.

5.2.3 Standardising the variables in the human data and repeat the above analysis

As we saw earlier the data is to dissimilar to use for PCA, hence scaling:

human_data_scaled <- scale(human_data_)
summary(human_data_scaled)
##   FM_2ED_Ratio     FM_LFPR_Ratio         EduExp          LifeExp_B      
##  Min.   :-2.8446   Min.   :-2.5987   Min.   :-2.7620   Min.   :-2.7457  
##  1st Qu.:-0.5220   1st Qu.:-0.5286   1st Qu.:-0.6816   1st Qu.:-0.6272  
##  Median : 0.3476   Median : 0.2408   Median : 0.1131   Median : 0.2937  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6123   3rd Qu.: 0.7484   3rd Qu.: 0.7381   3rd Qu.: 0.6524  
##  Max.   : 2.7134   Max.   : 1.6796   Max.   : 2.5239   Max.   : 1.4487  
##      GNI_C            MMRatio           AdBRate             RP_p        
##  Min.   :-0.9206   Min.   :-0.7127   Min.   :-1.1509   Min.   :-1.8531  
##  1st Qu.:-0.7057   1st Qu.:-0.6554   1st Qu.:-0.8315   1st Qu.:-0.7435  
##  Median :-0.2891   Median :-0.4549   Median :-0.3041   Median :-0.1398  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3167   3rd Qu.: 0.1896   3rd Qu.: 0.6012   3rd Qu.: 0.5925  
##  Max.   : 5.8094   Max.   : 4.5338   Max.   : 3.8760   Max.   : 3.2512
human_data_PCA_scaled <- prcomp(human_data_scaled)
summary(human_data_PCA_scaled)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0705 1.1430 0.87523 0.77704 0.66349 0.53451 0.45617
## Proportion of Variance 0.5359 0.1633 0.09575 0.07547 0.05503 0.03571 0.02601
## Cumulative Proportion  0.5359 0.6992 0.79496 0.87043 0.92546 0.96117 0.98718
##                            PC8
## Standard deviation     0.32021
## Proportion of Variance 0.01282
## Cumulative Proportion  1.00000
human_data_PCA_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0705291 1.1430478 0.8752283 0.7770364 0.6634878 0.5345100 0.4561665
## [8] 0.3202124
## 
## Rotation (n x k) = (8 x 8):
##                       PC1         PC2          PC3         PC4        PC5
## FM_2ED_Ratio  -0.35624746  0.05544659 -0.258173727  0.61994816 -0.5923317
## FM_LFPR_Ratio  0.05140236  0.72511638 -0.574905230  0.04778188  0.2842048
## EduExp        -0.42844344  0.13794051 -0.068952381 -0.06994677  0.1577789
## LifeExp_B     -0.44425251 -0.02871318  0.112160198 -0.05130932  0.1525719
## GNI_C         -0.35074145  0.05123343 -0.198230209 -0.73273506 -0.4859385
## MMRatio        0.43708962  0.14765550 -0.123634905 -0.25658826 -0.1736624
## AdBRate        0.41035458  0.08931249  0.008796231  0.05375321 -0.4810245
## RP_p          -0.08404578  0.64720641  0.728586153  0.01511136 -0.1500635
##                       PC6          PC7         PC8
## FM_2ED_Ratio   0.19541082  0.057381277  0.16336822
## FM_LFPR_Ratio -0.03316792 -0.226716792 -0.07410761
## EduExp        -0.39597156  0.776784601 -0.05176554
## LifeExp_B     -0.42259100 -0.439295194  0.62590822
## GNI_C          0.11842241 -0.138379206 -0.16985528
## MMRatio        0.17148338  0.351654675  0.72304994
## AdBRate       -0.74968006 -0.078049728 -0.14549600
## RP_p           0.14102220  0.005543132 -0.02360083

The PC1 is still the most important component, but now it only explains around 54% of variance, PC2 captures around 16%, then PC3 less than 10% and all the other components display diminishing returns.

names_for_the_biplot <- round(summary(human_data_PCA_scaled)$importance[2, ] * 100, digits = 2)
names_for_the_biplot <- paste0(names(names_for_the_biplot), " (", names_for_the_biplot, "%)")

biplot(human_data_PCA_scaled, choices = 1:2, xlab = names_for_the_biplot[1], ylab = names_for_the_biplot[2], cex = c(0.5, 0.7))

These results make al ot more sense and also are in concordance with what I originally concluded form the first graphical analysis: 1. PC1 is negatively associated with GNI, Life expectancy, Education and the ratio of educated females in the - all correlted positivley, and increase in MMortality and adolecent births leads to increase in PC1. These values were correlated to each other. 2. PC2 increases with the number of female representatives and female percentage in labour force, which are correlated to each other. As mentioned above the differences are due to the GNI being a much larger influence without scaling. With scaling we can actually see what’s happening.

5.2.4 My interpretation of the data.

The PC1 is related to what is usually descirbed as development level of a nation. Developed nations have high income, better education and life expectancy, which developing conuthries and failed states have high adolescent pregnancies and mortality. It is widely believed, that female education is the main driving factor in the population reproducibility index decrease. The second PC shows that there is a connection between female representation in government and overall inclusion in work force, but it is not strictly connected ot the “PC1 - development level” meaning that in poorer countries females can also participate in the govrnment and workforce. Also in petrocraties of the Middle East, despite high GNI and associated parameters females have lower representation.

5.2.5 Exploring the tea data and analysis

library(FactoMineR)
tea_data <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# View(tea_data)
str(tea_data)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Everything is a factor, apart from age. We can shoose the variables to look at. Seems more interesting to choose factual things, rather than emotional responses. The first 18 are the ones to choose from.

summary(tea_data)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
dim(tea_data)
## [1] 300  36
keep_tea_columns <- c("Tea", "How", "how", "sex",  "frequency","age_Q")



# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea_data,  all_of(keep_tea_columns))

I choose theese variables to look at.

ggpairs(tea_time,progress = F) 

summary(tea_time)
##         Tea         How                      how      sex           frequency  
##  black    : 74   alone:195   tea bag           :170   F:178   +2/day     :127  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   M:122   1 to 2/week: 44  
##  green    : 33   milk : 63   unpackaged        : 36           1/day      : 95  
##                  other:  9                                    3 to 6/week: 34  
##                                                                                
##    age_Q   
##  +60  :38  
##  15-24:92  
##  25-34:69  
##  35-44:40  
##  45-59:61

People like Earl Grey. There are prefered methods of drinking tea.

mca <- MCA(tea_time)

summary(mca)
## 
## Call:
## MCA(X = tea_time) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.262   0.241   0.210   0.196   0.194   0.184   0.174
## % of var.             10.460   9.638   8.416   7.858   7.744   7.375   6.950
## Cumulative % of var.  10.460  20.098  28.514  36.372  44.115  51.490  58.440
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.169   0.152   0.143   0.142   0.122   0.112   0.108
## % of var.              6.777   6.069   5.723   5.674   4.894   4.493   4.324
## Cumulative % of var.  65.217  71.287  77.010  82.683  87.578  92.070  96.395
##                       Dim.15
## Variance               0.090
## % of var.              3.605
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  |  0.522  0.347  0.113 |  0.206  0.059  0.018 | -0.471
## 2                  |  0.406  0.210  0.069 | -0.092  0.012  0.004 | -0.982
## 3                  | -0.227  0.066  0.040 | -0.383  0.203  0.112 | -0.064
## 4                  | -0.534  0.363  0.221 |  0.338  0.158  0.088 | -0.377
## 5                  | -0.029  0.001  0.001 |  0.058  0.005  0.002 | -0.140
## 6                  | -0.534  0.363  0.221 |  0.338  0.158  0.088 | -0.377
## 7                  |  0.113  0.016  0.004 |  0.120  0.020  0.005 |  0.134
## 8                  |  0.486  0.302  0.069 | -0.258  0.092  0.019 | -0.746
## 9                  |  0.165  0.035  0.010 | -0.072  0.007  0.002 | -0.257
## 10                 |  0.576  0.422  0.132 | -0.267  0.099  0.028 |  0.273
##                       ctr   cos2  
## 1                   0.352  0.092 |
## 2                   1.527  0.403 |
## 3                   0.007  0.003 |
## 4                   0.225  0.110 |
## 5                   0.031  0.014 |
## 6                   0.225  0.110 |
## 7                   0.028  0.006 |
## 8                   0.881  0.162 |
## 9                   0.105  0.025 |
## 10                  0.118  0.030 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   1.178  21.809   0.454  11.654 |  -0.433   3.192   0.061
## Earl Grey          |  -0.530  11.509   0.506 -12.304 |   0.021   0.020   0.001
## green              |   0.457   1.467   0.026   2.780 |   0.845   5.427   0.088
## alone              |  -0.198   1.620   0.073  -4.659 |   0.014   0.009   0.000
## lemon              |   0.169   0.200   0.004   1.027 |   0.272   0.563   0.009
## milk               |   0.251   0.841   0.017   2.235 |   0.136   0.270   0.005
## other              |   1.910   6.974   0.113   5.808 |  -2.256  10.559   0.157
## tea bag            |  -0.146   0.771   0.028  -2.889 |   0.007   0.002   0.000
## tea bag+unpackaged |  -0.194   0.749   0.017  -2.261 |  -0.198   0.849   0.018
## unpackaged         |   1.195  10.929   0.195   7.633 |   0.485   1.953   0.032
##                     v.test     Dim.3     ctr    cos2  v.test  
## black               -4.280 |  -0.072   0.102   0.002  -0.715 |
## Earl Grey            0.498 |  -0.004   0.001   0.000  -0.094 |
## green                5.134 |   0.186   0.301   0.004   1.129 |
## alone                0.330 |   0.328   5.524   0.199   7.718 |
## lemon                1.654 |   0.582   2.951   0.042   3.538 |
## milk                 1.216 |  -1.200  23.938   0.383 -10.694 |
## other               -6.859 |  -0.834   1.651   0.021  -2.535 |
## tea bag              0.133 |  -0.563  14.221   0.414 -11.130 |
## tea bag+unpackaged  -2.312 |   0.555   7.642   0.140   6.481 |
## unpackaged           3.097 |   1.209  13.897   0.199   7.720 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.546 0.125 0.005 |
## How                | 0.151 0.165 0.430 |
## how                | 0.195 0.041 0.451 |
## sex                | 0.089 0.407 0.011 |
## frequency          | 0.021 0.286 0.270 |
## age_Q              | 0.567 0.422 0.096 |
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)

plot(mca, graph.type="ggplot", invisible=c("ind"), cex=0.8, habillage="quali")

Looking at the plot we can assume that there is a correlation between the age and which tea the person prefers:

  • 15-24 prefer earl gray
  • oldest prefer black tea.
  • 25-35 prefer green tea
  • Drinking tea straight is a younger quality, older prefer tea with something.
  • Middle age is associated with green tea bags daily.

6: Analysis of longitudinal data